Data Science has become one of the most important disciplines in the world. It is used in many fields, such as healthcare, finance, and transportation. In this notebook, we will explore the Stack Overflow survey data to understand the data science discipline and the developers who work in this field.
- Understand Data Scientist Developers
- Salary Expectations for data scientists
- Most popular languages, databases, and platforms among data scientists
About the dataset: The dataset we will use is the Stack Overflow survey data from 2024. It contains information about developers, their job roles, and the technologies they use. We will focus on the data scientists in this dataset. Find the dataset here: https://survey.stackoverflow.co/
import pandas as pd
import matplotlib.pyplot as plt
import plotly.io as pio
pio.templates["custom_white"] = pio.templates["plotly"]
pio.templates["custom_white"]["layout"]["paper_bgcolor"] = "white"
pio.templates.default = "custom_white"
df = pd.read_csv("data/survey_results_public.csv")
df
| ResponseId | MainBranch | Age | Employment | RemoteWork | Check | CodingActivities | EdLevel | LearnCode | LearnCodeOnline | ... | JobSatPoints_6 | JobSatPoints_7 | JobSatPoints_8 | JobSatPoints_9 | JobSatPoints_10 | JobSatPoints_11 | SurveyLength | SurveyEase | ConvertedCompYearly | JobSat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | I am a developer by profession | Under 18 years old | Employed, full-time | Remote | Apples | Hobby | Primary/elementary school | Books / Physical media | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2 | I am a developer by profession | 35-44 years old | Employed, full-time | Remote | Apples | Hobby;Contribute to open-source projects;Other... | Bachelor’s degree (B.A., B.S., B.Eng., etc.) | Books / Physical media;Colleague;On the job tr... | Technical documentation;Blogs;Books;Written Tu... | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN |
| 2 | 3 | I am a developer by profession | 45-54 years old | Employed, full-time | Remote | Apples | Hobby;Contribute to open-source projects;Other... | Master’s degree (M.A., M.S., M.Eng., MBA, etc.) | Books / Physical media;Colleague;On the job tr... | Technical documentation;Blogs;Books;Written Tu... | ... | NaN | NaN | NaN | NaN | NaN | NaN | Appropriate in length | Easy | NaN | NaN |
| 3 | 4 | I am learning to code | 18-24 years old | Student, full-time | NaN | Apples | NaN | Some college/university study without earning ... | Other online resources (e.g., videos, blogs, f... | Stack Overflow;How-to videos;Interactive tutorial | ... | NaN | NaN | NaN | NaN | NaN | NaN | Too long | Easy | NaN | NaN |
| 4 | 5 | I am a developer by profession | 18-24 years old | Student, full-time | NaN | Apples | NaN | Secondary school (e.g. American high school, G... | Other online resources (e.g., videos, blogs, f... | Technical documentation;Blogs;Written Tutorial... | ... | NaN | NaN | NaN | NaN | NaN | NaN | Too short | Easy | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 65432 | 65433 | I am a developer by profession | 18-24 years old | Employed, full-time | Remote | Apples | Hobby;School or academic work | Bachelor’s degree (B.A., B.S., B.Eng., etc.) | On the job training;School (i.e., University, ... | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 65433 | 65434 | I am a developer by profession | 25-34 years old | Employed, full-time | Remote | Apples | Hobby;Contribute to open-source projects | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 65434 | 65435 | I am a developer by profession | 25-34 years old | Employed, full-time | In-person | Apples | Hobby | Bachelor’s degree (B.A., B.S., B.Eng., etc.) | Other online resources (e.g., videos, blogs, f... | Technical documentation;Stack Overflow;Social ... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 65435 | 65436 | I am a developer by profession | 18-24 years old | Employed, full-time | Hybrid (some remote, some in-person) | Apples | Hobby;Contribute to open-source projects;Profe... | Secondary school (e.g. American high school, G... | On the job training;Other online resources (e.... | Technical documentation;Blogs;Written Tutorial... | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN |
| 65436 | 65437 | I code primarily as a hobby | 18-24 years old | Student, full-time | NaN | Apples | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
65437 rows × 114 columns
df.columns
Index(['ResponseId', 'MainBranch', 'Age', 'Employment', 'RemoteWork', 'Check',
'CodingActivities', 'EdLevel', 'LearnCode', 'LearnCodeOnline',
...
'JobSatPoints_6', 'JobSatPoints_7', 'JobSatPoints_8', 'JobSatPoints_9',
'JobSatPoints_10', 'JobSatPoints_11', 'SurveyLength', 'SurveyEase',
'ConvertedCompYearly', 'JobSat'],
dtype='object', length=114)
selected_columns = [
'Age',
'Employment',
'RemoteWork',
'EdLevel',
'LearnCode',
'LearnCodeOnline',
'YearsCode',
'YearsCodePro',
'OrgSize',
'Country',
'CompTotal',
'DevType',
'LanguageHaveWorkedWith',
'LanguageWantToWorkWith',
'LanguageAdmired',
'DatabaseHaveWorkedWith',
'DatabaseWantToWorkWith',
'DatabaseAdmired',
'PlatformHaveWorkedWith',
'PlatformWantToWorkWith',
'PlatformAdmired',
'WebframeHaveWorkedWith',
'WebframeWantToWorkWith',
'WebframeAdmired',
'EmbeddedHaveWorkedWith',
'EmbeddedWantToWorkWith',
'EmbeddedAdmired',
'MiscTechHaveWorkedWith',
'MiscTechWantToWorkWith',
'MiscTechAdmired',
'ToolsTechHaveWorkedWith',
'ToolsTechWantToWorkWith',
'ToolsTechAdmired',
'NEWCollabToolsHaveWorkedWith',
'NEWCollabToolsWantToWorkWith',
'NEWCollabToolsAdmired',
'OpSysPersonal use',
'OpSysProfessional use',
'OfficeStackAsyncHaveWorkedWith',
'OfficeStackAsyncWantToWorkWith',
'OfficeStackAsyncAdmired',
'OfficeStackSyncHaveWorkedWith',
'OfficeStackSyncWantToWorkWith',
'OfficeStackSyncAdmired',
'AISearchDevHaveWorkedWith',
'AISearchDevWantToWorkWith',
'AISearchDevAdmired',
'AISelect',
'AISent',
'AIBen',
'AIAcc',
'AIComplex',
'AIToolCurrently Using',
'AIToolInterested in Using',
'AIToolNot interested in Using',
'AINextMuch more integrated',
'AINextNo change',
'AINextMore integrated',
'AINextLess integrated',
'AINextMuch less integrated',
'AIThreat',
'AIEthics',
'AIChallenges',
'Industry',
'WorkExp',
'JobSat',
]
df = df[selected_columns]
df
| Age | Employment | RemoteWork | EdLevel | LearnCode | LearnCodeOnline | YearsCode | YearsCodePro | OrgSize | Country | ... | AINextNo change | AINextMore integrated | AINextLess integrated | AINextMuch less integrated | AIThreat | AIEthics | AIChallenges | Industry | WorkExp | JobSat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Under 18 years old | Employed, full-time | Remote | Primary/elementary school | Books / Physical media | NaN | NaN | NaN | NaN | United States of America | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 35-44 years old | Employed, full-time | Remote | Bachelor’s degree (B.A., B.S., B.Eng., etc.) | Books / Physical media;Colleague;On the job tr... | Technical documentation;Blogs;Books;Written Tu... | 20 | 17 | NaN | United Kingdom of Great Britain and Northern I... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 17.0 | NaN |
| 2 | 45-54 years old | Employed, full-time | Remote | Master’s degree (M.A., M.S., M.Eng., MBA, etc.) | Books / Physical media;Colleague;On the job tr... | Technical documentation;Blogs;Books;Written Tu... | 37 | 27 | NaN | United Kingdom of Great Britain and Northern I... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 18-24 years old | Student, full-time | NaN | Some college/university study without earning ... | Other online resources (e.g., videos, blogs, f... | Stack Overflow;How-to videos;Interactive tutorial | 4 | NaN | NaN | Canada | ... | NaN | NaN | NaN | NaN | No | Circulating misinformation or disinformation;M... | Don’t trust the output or answers | NaN | NaN | NaN |
| 4 | 18-24 years old | Student, full-time | NaN | Secondary school (e.g. American high school, G... | Other online resources (e.g., videos, blogs, f... | Technical documentation;Blogs;Written Tutorial... | 9 | NaN | NaN | Norway | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 65432 | 18-24 years old | Employed, full-time | Remote | Bachelor’s degree (B.A., B.S., B.Eng., etc.) | On the job training;School (i.e., University, ... | NaN | 5 | 3 | 2 to 9 employees | NaN | ... | NaN | Learning about a codebase;Project planning;Doc... | NaN | NaN | No | Circulating misinformation or disinformation | AI tools lack context of codebase, internal a... | NaN | NaN | NaN |
| 65433 | 25-34 years old | Employed, full-time | Remote | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 65434 | 25-34 years old | Employed, full-time | In-person | Bachelor’s degree (B.A., B.S., B.Eng., etc.) | Other online resources (e.g., videos, blogs, f... | Technical documentation;Stack Overflow;Social ... | 9 | 5 | 1,000 to 4,999 employees | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 65435 | 18-24 years old | Employed, full-time | Hybrid (some remote, some in-person) | Secondary school (e.g. American high school, G... | On the job training;Other online resources (e.... | Technical documentation;Blogs;Written Tutorial... | 5 | 2 | 20 to 99 employees | Germany | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5.0 | NaN |
| 65436 | 18-24 years old | Student, full-time | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
65437 rows × 66 columns
import plotly.express as px
initial_dev_type_values = df["DevType"].value_counts()
sorted_values = initial_dev_type_values.sort_values(ascending=True)
fig = px.bar(sorted_values,
orientation='h',
title='Initial Developer Roles',
labels={'value': 'Number of Respondents', 'index': 'Developer Role'},
color_discrete_sequence=['skyblue'])
fig.update_layout(xaxis_title='Number of Respondents', yaxis_title='Developer Role')
fig.show()
1. Understand Data Scientist Developers¶
To explain what data scientists have in common, we will split the DevType column, to look only on some specific roles like Data or business analyst, Data scientist or machine learning specialist, Data engineer, Developer, AI, Scientist. Then we will create a pie chart to visualize the distribution of these roles among data scientists.
Pie Chart¶
We can see on the Pie chart that the most common role among data scientists is Data Engineer, followed by Data scientist or machine learning specialist, ...
import plotly.graph_objects as go
import plotly.express as px
disciplines = [
"Data or business analyst",
"Data scientist or machine learning specialist",
"Data engineer",
"Developer, AI",
"Scientist"
]
df_ai = df[df["DevType"].str.contains("|".join(disciplines), na=False)]
devtype_counts = df_ai["DevType"].value_counts()
fig = go.Figure(data=[go.Pie(labels=devtype_counts.index,
values=devtype_counts.values,
textinfo='percent+label',
marker=dict(colors=px.colors.qualitative.Set3),
hole=.3)])
fig.update_layout(title_text='Developer Roles')
fig.show()
Industry¶
We'll create a pie chart to visualize the distribution of industries where data scientists work. This will help us understand the industries that are most likely to employ data scientists.
import plotly.graph_objects as go
industry_counts = df_ai['Industry'].value_counts()
fig = go.Figure(data=[go.Pie(labels=industry_counts.index,
values=industry_counts.values,
textinfo='percent+label',
hole=.3)])
fig.update_layout(title_text='Distribution of Industries')
fig.show()
Now we'll remap the countries to continents, that will bed used later to visualize the distribution of data scientists by continent.
country_to_continent = {
# Africa
'Algeria': 'Africa',
'Angola': 'Africa',
'Benin': 'Africa',
'Botswana': 'Africa',
'Burkina Faso': 'Africa',
'Burundi': 'Africa',
'Cabo Verde': 'Africa',
'Cameroon': 'Africa',
'Central African Republic': 'Africa',
'Chad': 'Africa',
'Comoros': 'Africa',
'Congo': 'Africa',
'Djibouti': 'Africa',
'Egypt': 'Africa',
'Equatorial Guinea': 'Africa',
'Eritrea': 'Africa',
'Eswatini': 'Africa',
'Ethiopia': 'Africa',
'Gabon': 'Africa',
'Gambia': 'Africa',
'Ghana': 'Africa',
'Guinea': 'Africa',
'Guinea-Bissau': 'Africa',
'Ivory Coast': 'Africa',
'Kenya': 'Africa',
'Lesotho': 'Africa',
'Liberia': 'Africa',
'Libya': 'Africa',
'Madagascar': 'Africa',
'Malawi': 'Africa',
'Mali': 'Africa',
'Mauritania': 'Africa',
'Mauritius': 'Africa',
'Morocco': 'Africa',
'Mozambique': 'Africa',
'Namibia': 'Africa',
'Niger': 'Africa',
'Nigeria': 'Africa',
'Rwanda': 'Africa',
'Sao Tome and Principe': 'Africa',
'Senegal': 'Africa',
'Seychelles': 'Africa',
'Sierra Leone': 'Africa',
'Somalia': 'Africa',
'South Africa': 'Africa',
'South Sudan': 'Africa',
'Sudan': 'Africa',
'Tanzania': 'Africa',
'Togo': 'Africa',
'Tunisia': 'Africa',
'Uganda': 'Africa',
'Zambia': 'Africa',
'Zimbabwe': 'Africa',
# Asia
'Afghanistan': 'Asia',
'Armenia': 'Asia',
'Azerbaijan': 'Asia',
'Bahrain': 'Asia',
'Bangladesh': 'Asia',
'Bhutan': 'Asia',
'Brunei': 'Asia',
'Cambodia': 'Asia',
'China': 'Asia',
'Cyprus': 'Asia',
'Georgia': 'Asia',
'India': 'Asia',
'Indonesia': 'Asia',
'Iran': 'Asia',
'Iraq': 'Asia',
'Israel': 'Asia',
'Japan': 'Asia',
'Jordan': 'Asia',
'Kazakhstan': 'Asia',
'Kuwait': 'Asia',
'Kyrgyzstan': 'Asia',
'Laos': 'Asia',
'Lebanon': 'Asia',
'Malaysia': 'Asia',
'Maldives': 'Asia',
'Mongolia': 'Asia',
'Myanmar': 'Asia',
'Nepal': 'Asia',
'North Korea': 'Asia',
'Oman': 'Asia',
'Pakistan': 'Asia',
'Palestine': 'Asia',
'Philippines': 'Asia',
'Qatar': 'Asia',
'Saudi Arabia': 'Asia',
'Singapore': 'Asia',
'South Korea': 'Asia',
'Sri Lanka': 'Asia',
'Syria': 'Asia',
'Taiwan': 'Asia',
'Tajikistan': 'Asia',
'Thailand': 'Asia',
'Timor-Leste': 'Asia',
'Turkey': 'Asia',
'Turkmenistan': 'Asia',
'United Arab Emirates': 'Asia',
'Uzbekistan': 'Asia',
'Vietnam': 'Asia',
'Yemen': 'Asia',
# Europe
'Albania': 'Europe',
'Andorra': 'Europe',
'Armenia': 'Europe',
'Austria': 'Europe',
'Azerbaijan': 'Europe',
'Belarus': 'Europe',
'Belgium': 'Europe',
'Bosnia and Herzegovina': 'Europe',
'Bulgaria': 'Europe',
'Croatia': 'Europe',
'Cyprus': 'Europe',
'Czech Republic': 'Europe',
'Denmark': 'Europe',
'Estonia': 'Europe',
'Finland': 'Europe',
'France': 'Europe',
'Georgia': 'Europe',
'Germany': 'Europe',
'Greece': 'Europe',
'Hungary': 'Europe',
'Iceland': 'Europe',
'Ireland': 'Europe',
'Italy': 'Europe',
'Kazakhstan': 'Europe',
'Kosovo': 'Europe',
'Latvia': 'Europe',
'Liechtenstein': 'Europe',
'Lithuania': 'Europe',
'Luxembourg': 'Europe',
'Malta': 'Europe',
'Moldova': 'Europe',
'Monaco': 'Europe',
'Montenegro': 'Europe',
'Netherlands': 'Europe',
'North Macedonia': 'Europe',
'Norway': 'Europe',
'Poland': 'Europe',
'Portugal': 'Europe',
'Romania': 'Europe',
'Russian Federation': 'Europe',
'San Marino': 'Europe',
'Serbia': 'Europe',
'Slovakia': 'Europe',
'Slovenia': 'Europe',
'Spain': 'Europe',
'Sweden': 'Europe',
'Switzerland': 'Europe',
'Ukraine': 'Europe',
'United Kingdom': 'Europe',
'Vatican City': 'Europe',
# North America
'Antigua and Barbuda': 'North America',
'Bahamas': 'North America',
'Barbados': 'North America',
'Belize': 'North America',
'Canada': 'North America',
'Costa Rica': 'North America',
'Cuba': 'North America',
'Dominica': 'North America',
'Dominican Republic': 'North America',
'El Salvador': 'North America',
'Grenada': 'North America',
'Guatemala': 'North America',
'Haiti': 'North America',
'Honduras': 'North America',
'Jamaica': 'North America',
'Mexico': 'North America',
'Nicaragua': 'North America',
'Panama': 'North America',
'Saint Kitts and Nevis': 'North America',
'Saint Lucia': 'North America',
'Saint Vincent and the Grenadines': 'North America',
'Trinidad and Tobago': 'North America',
'United States': 'North America',
# South America
'Argentina': 'South America',
'Bolivia': 'South America',
'Brazil': 'South America',
'Chile': 'South America',
'Colombia': 'South America',
'Ecuador': 'South America',
'Guyana': 'South America',
'Paraguay': 'South America',
'Peru': 'South America',
'Suriname': 'South America',
'Uruguay': 'South America',
'Venezuela': 'South America',
# Oceania
'Australia': 'Oceania',
'Fiji': 'Oceania',
'Kiribati': 'Oceania',
'Marshall Islands': 'Oceania',
'Micronesia': 'Oceania',
'Nauru': 'Oceania',
'New Zealand': 'Oceania',
'Palau': 'Oceania',
'Papua New Guinea': 'Oceania',
'Samoa': 'Oceania',
'Solomon Islands': 'Oceania',
'Tonga': 'Oceania',
'Tuvalu': 'Oceania',
'Vanuatu': 'Oceania',
}
country_to_continent.update({
'United States of America': 'North America',
'United Kingdom of Great Britain and Northern Ireland': 'Europe',
'Iran, Islamic Republic of...': 'Asia',
'Viet Nam': 'Asia',
'Hong Kong (S.A.R.)': 'Asia',
'United Republic of Tanzania': 'Africa',
'Syrian Arab Republic': 'Asia',
'Republic of Moldova': 'Europe',
'Republic of Korea': 'Asia',
'Isle of Man': 'Europe',
'Venezuela, Bolivarian Republic of...': 'South America',
'Congo, Republic of the...': 'Africa',
'Nomadic': 'Other'
})
df_ai['Continent'] = df_ai['Country'].map(country_to_continent).fillna('Other')
/var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/3550603098.py:232: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
What are the expertise of data scientists? Like measure it by
Professional Experience¶
We will categorize the years of coding experience into groups to understand the distribution of experience levels among data scientists. This will help us identify the experience levels that are most common in the field.
Insight:¶
- Most data scientists have between 2-5 years of coding experience.
import plotly.express as px
def categorize_years_code(years):
if years == "Less than 1 year":
return "0-1"
try:
years = int(years)
if years == 1:
return "1"
elif 2 <= years <= 5:
return "2-5"
elif 6 <= years <= 10:
return "6-10"
else:
return "10+"
except ValueError:
return years
df_ai.loc[:, 'YearsCodeGroup'] = df_ai['YearsCodePro'].apply(categorize_years_code)
years_code_counts = df_ai['YearsCodeGroup'].value_counts().sort_values()
fig = px.bar(years_code_counts,
orientation='h',
title='Coding Experience',
labels={'value': 'Number of Respondents', 'index': 'Years of Coding Experience'},
color_discrete_sequence=['skyblue'])
fig.update_layout(xaxis_title='Number of Respondents', yaxis_title='Years of Coding Experience')
fig.show()
/var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/2128694638.py:19: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Age¶
We will create a horizontal bar chart to visualize the distribution of ages among data scientists. This will help us understand the age groups that are most common in the field.
Insight:¶
- Most data scientists are between the ages of 25-34.
- The age distribution is skewed towards younger developers, with fewer developers in the older age groups.
import plotly.express as px
age_distribution = df_ai['Age'].value_counts()
age_distribution = age_distribution.sort_values()
fig = px.bar(age_distribution,
orientation='h',
title='Ages of data scientists',
labels={'value': 'Number of Respondents', 'index': 'Age Group'},
color_discrete_sequence=['skyblue'])
fig.update_layout(xaxis_title='Number of Respondents', yaxis_title='Age Group')
fig.show()
Employment Status¶
We will categorize the employment status of data scientists into simple groups like:
- Full Time
- Partial
- Freelancer
Insight:¶
- Most data scientists are employed full-time, followed by freelancers and part-time workers.
- A small percentage of data scientists are not employed.
- The majority of data scientists work full-time.
- So this discipline requires from you to be full-time employed.
import plotly.graph_objects as go
def categorize_employment(status):
if "full-time" in status and "freelancer" not in status and "self-employed" not in status:
return "Full Time"
elif "part-time" in status or "Student" in status:
return "Partial"
elif "freelancer" in status or "self-employed" in status:
return "Freelancer"
else:
return "Non Employed"
df_ai.loc[:, 'EmploymentStatus'] = df_ai['Employment'].apply(categorize_employment)
employment_counts = df_ai['EmploymentStatus'].value_counts()
fig = go.Figure(data=[go.Pie(labels=employment_counts.index,
values=employment_counts.values,
textinfo='percent+label',
hole=.3)])
fig.update_layout(title_text='Employment Status')
fig.show()
/var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/2605551934.py:13: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Remote Work¶
COVID-19 has changed the way people work, with many companies adopting remote work policies. We will create a pie chart to visualize the distribution of remote work status.
Insight:¶
- The hybrid model is the most common remote work status among data scientists.
- A significant percentage of data scientists work fully remotely.
- A small percentage of data scientists do not work remotely.
import plotly.graph_objects as go
remote_work_counts = df_ai['RemoteWork'].value_counts()
fig = go.Figure(data=[go.Pie(labels=remote_work_counts.index,
values=remote_work_counts.values,
textinfo='percent',
hole=.3)])
fig.update_layout(title_text='Remote Work Status')
fig.show()
Education Level vs Coding Experience¶
We will create a stacked bar chart to visualize the distribution of education levels among data scientists based on their coding experience. This will help us understand the relationship between education and experience levels in the field.
Insight:¶
- Most data scientists have a Bachelor's or Master's degree.
- The distribution of education levels is consistent across different experience levels.
- Junior developers are more likely to have a Bachelor's degree, while Senior developers are more likely to have a Master's degree.
- As data science students, you should have a Bachelor's or Master's degree and you can obtain a job as a Junior developer.
import plotly.express as px
import pandas as pd
def categorize_experience(years):
if years in ["Less than 1 year", "0-1", "1"]:
return "Junior"
try:
years = int(years)
if years <= 5:
return "Junior"
elif 6 <= years <= 10:
return "Semi Senior"
else:
return "Senior"
except ValueError:
return "Other"
df_ai.loc[:,'ExperienceLevel'] = df_ai['YearsCodePro'].apply(categorize_experience)
valid_categories = ['Junior', 'Semi Senior', 'Senior', 'Other']
df_ai.loc[:,'ExperienceLevel'] = pd.Categorical(df_ai['ExperienceLevel'], categories=valid_categories, ordered=True)
df_ai.loc[:,'EdLevelCopy'] = df_ai.loc[:,'EdLevel'].copy()
ed_level_map = {
"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)": "Master's",
"Bachelor’s degree (B.A., B.S., B.Eng., etc.)": "Bachelor's",
"Professional degree (JD, MD, Ph.D, Ed.D, etc.)": "Professional",
"Some college/university study without earning a degree": "Some College",
"Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)": "Secondary",
"Associate degree (A.A., A.S., etc.)": "Associate",
"Something else": "Other",
"Primary/elementary school": "Primary"
}
df_ai.loc[:,'EdLevelCopy'] = df_ai.loc[:,'EdLevelCopy'].map(ed_level_map)
grouped_data = df_ai.groupby(['EdLevelCopy', 'ExperienceLevel'], observed=True).size().unstack(fill_value=0)
grouped_data = grouped_data.loc[grouped_data.sum(axis=1).sort_values(ascending=False).index]
fig = px.bar(grouped_data,
x=grouped_data.index,
y=grouped_data.columns,
title='Education Level vs Coding Experience',
labels={'value': 'Count', 'EdLevelCopy': 'Education Level'},
barmode='stack')
fig.update_traces(texttemplate='%{value}', textposition='inside')
fig.update_layout(xaxis_title='Education Level', yaxis_title='Count', legend_title_text='Experience Level')
fig.show()
/var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/2202332092.py:18: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/2202332092.py:22: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
2. Salary Expectations for data scientists¶
Salary Distribution¶
We will create a boxplot to visualize the distribution and skewness of salary. This will help us understand the typical salary range for data scientists and also normalize our data by filtering out the top 10% of salaries to get a better understanding of the salary distribution.
Insight:¶
- Our data is scattered and has a long tail, indicating a wide range of salaries.
import plotly.graph_objects as go
total_compensation = df_ai['CompTotal'].dropna()
percentile_90 = total_compensation.quantile(0.90)
df_ai_filtered = df_ai[df_ai['CompTotal'] <= percentile_90]
fig = go.Figure()
fig.add_trace(go.Box(y=df_ai_filtered['CompTotal'], boxpoints='all', jitter=0.3, pointpos=-1.8))
fig.update_layout(title='Box Plot of CompTotal (Filtered Below 90th Percentile)',
yaxis_title='Compensation Total',
xaxis=dict(visible=False),
showlegend=False)
fig.show()
print("Original count:", len(total_compensation))
print("Filtered count:", len(df_ai_filtered))
print("90th percentile value:", percentile_90)
outliers_count = df_ai[df_ai['CompTotal'] > percentile_90].shape[0]
print("Number of values above the 90th percentile:", outliers_count)
Original count: 2186 Filtered count: 1977 90th percentile value: 1200000.0 Number of values above the 90th percentile: 209
Normal Distribution¶
We will create a histogram of salary data and fit a normal distribution curve to it.
Insight:¶
- The histogram of salary data is not normally distributed.
- The fitted normal distribution curve does not match the histogram, indicating that the data is not normally distributed.
- We can see that the data is right-skewed, with a long tail on the right side of the distribution.
- For that we are going to adjust the outliers to get a better understanding of the salary distribution.
import numpy as np
import seaborn as sns
import scipy.stats as stats
filtered_compensation = df_ai_filtered['CompTotal']
plt.figure(figsize=(10, 6))
sns.histplot(filtered_compensation, kde=False, bins=30, color='blue', stat='density')
# Fit a normal distribution to the data
mu, std = stats.norm.fit(filtered_compensation)
# Plot the normal distribution curve
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, std)
# Adjust the normalization of the curve to match the histogram's density
plt.plot(x, p, 'k', linewidth=2, color='red')
plt.title('Histogram of CompTotal with Fitted Normal Distribution')
plt.xlabel('Compensation Total')
plt.ylabel('Density')
plt.grid(True)
plt.show()
# Print the mean and standard deviation for reference
print(f"Mean: {mu}")
print(f"Standard Deviation: {std}")
/var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/470756462.py:19: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "k" (-> color=(0.0, 0.0, 0.0, 1)). The keyword argument will take precedence.
Mean: 171242.17197774406 Standard Deviation: 215516.05052381053
Salary by Experience Level¶
Created a boxplot to visualize salary by experience level, because we want to understand if the skewness is related to the experience level. But we can see that the skewness is relevant on the three experience levels.
Insight:¶
- For that we are going to adjust the outliers to get a better understanding of the salary distribution by removing the top 25% of the data.
import plotly.express as px
fig = px.box(df_ai_filtered,
x='ExperienceLevel',
y='CompTotal',
category_orders={'ExperienceLevel': valid_categories},
title='Boxplot of Compensation by Experience Level',
labels={'ExperienceLevel': 'Experience Level', 'CompTotal': 'Compensation Total'})
fig.show()
Now removing the top 25% of the data to get a better understanding of the salary distribution
As we can see the data now seems to be normally distributed, and low skewness. So we can do another graphic that will show the distribution of the data.
import plotly.graph_objects as go
total_compensation = df_ai['CompTotal'].dropna()
percentile_75 = total_compensation.quantile(0.75)
df_ai_filtered_2 = df_ai[df_ai['CompTotal'] <= percentile_75]
fig = go.Figure()
fig.add_trace(go.Box(y=df_ai_filtered_2['CompTotal'], boxpoints='all', jitter=0.3, pointpos=-1.8))
fig.update_layout(title='Box Plot of CompTotal (Filtered Below 75th Percentile)',
yaxis_title='Compensation Total',
xaxis=dict(visible=False),
showlegend=False)
fig.show()
Normal Distribution¶
We are going to perform a histogram of salary data and fit a normal distribution curve to it using the library scipy.stats, that this library will help us to fit the normal distribution curve to the data.
Insight:¶
- The histogram of salary data is normally distributed.
- The fitted normal distribution curve matches the histogram, indicating that the data is normally distributed.
- We are going to use this data to display the other charts.
import numpy as np
import seaborn as sns
import scipy.stats as stats
filtered_compensation = df_ai_filtered_2['CompTotal']
plt.figure(figsize=(10, 6))
sns.histplot(filtered_compensation, kde=False, bins=30, color='blue', stat='density')
mu, std = stats.norm.fit(filtered_compensation)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2, color='red')
plt.title('Histogram of CompTotal with Fitted Normal Distribution')
plt.xlabel('Compensation Total')
plt.ylabel('Density')
plt.grid(True)
plt.show()
print(f"Mean: {mu}")
print(f"Standard Deviation: {std}")
/var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/1092531397.py:16: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "k" (-> color=(0.0, 0.0, 0.0, 1)). The keyword argument will take precedence.
Mean: 92786.49115314215 Standard Deviation: 54162.009649499676
Salary vs Experience Level¶
We'll create a boxplot to understand the salary vs experience level.
Insight¶
- As the tech industry, with more experience the salary is higher, and we can see that in the graphic below.
- Junior devs are bteween USD 40000 and USD100000
- Semi Senior devs are between USD 60000 and USD 140000
- Senior devs are between USD 70000 and USD 160000
import plotly.express as px
fig = px.box(df_ai_filtered_2,
x='ExperienceLevel',
y='CompTotal',
category_orders={'ExperienceLevel': valid_categories},
title='Boxplot of Compensation by Experience Level',
labels={'ExperienceLevel': 'Experience Level', 'CompTotal': 'Compensation Total'})
fig.show()
Country¶
We'll create a choropleth chart, that will help us to visualize the distribution of data scientists by country. This will help us understand the countries that have the highest number of data scientists.
Insight:¶
- Most data scientists are from the United States, followed by India and Germany.
- South America and Africa have the lowest number of data scientists.
import plotly.express as px
country_counts = df_ai['Country'].value_counts().reset_index()
country_counts.columns = ['Country', 'Count']
fig = px.choropleth(
country_counts,
locations='Country',
locationmode='country names',
color='Count',
hover_name='Country',
color_continuous_scale=px.colors.sequential.Plasma,
title='Country Chart to Represent the respondents'
)
fig.update_layout(
geo=dict(
showframe=False,
showcoastlines=True,
projection_type='equirectangular',
),
title={
'text': 'Country Chart to Represent the respondents',
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'
},
margin=dict(l=0, r=0, t=0, b=0)
)
fig.update_coloraxes(colorbar_title="People", colorbar=dict(
len=0.75,
thickness=15,
))
fig.show()
Now we are going to exclude the United States to get a better understanding of the distribution of data scientists by country.
df_ai_exclude_us = df_ai[df_ai['Country'] != 'United States of America']
country_counts = df_ai_exclude_us['Country'].value_counts().reset_index()
country_counts.columns = ['Country', 'Count']
fig = px.choropleth(
country_counts,
locations='Country',
locationmode='country names',
color='Count',
hover_name='Country',
color_continuous_scale=px.colors.sequential.Plasma,
title='Country Chart to Represent the respondents'
)
fig.update_layout(
geo=dict(
showframe=False,
showcoastlines=True,
projection_type='equirectangular', # Adjust projection for better global representation
),
title={
'text': 'Country Chart to Represent the respondents',
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'
},
margin=dict(l=0, r=0, t=50, b=0) # Adjust margins to center the map
)
fig.update_coloraxes(colorbar_title="People", colorbar=dict(
len=0.75, # Adjust length
thickness=15, # Adjust thickness
))
fig.show()
import plotly.graph_objects as go
continent_counts = df_ai['Continent'].value_counts()
labels = continent_counts.index
sizes = continent_counts.values
fig = go.Figure(data=[go.Pie(labels=labels, values=sizes, hole=.3)])
fig.update_traces(textinfo='percent+label')
fig.update_layout(title_text='Distribution of Entries by Continent')
fig.show()
import plotly.express as px
fig = px.box(df_ai_filtered_2,
x='Continent',
y='CompTotal',
title='Boxplot of Compensation Salary by Continent',
labels={'CompTotal': 'Compensation Salary', 'Continent': 'Continent'})
fig.show()
import plotly.express as px
fig = px.box(df_ai_filtered_2,
x='RemoteWork',
y='CompTotal',
color='Continent',
title='Boxplot of Compensation Total by Remote Work Category and Continent',
labels={'RemoteWork': 'Remote Work Category', 'CompTotal': 'Compensation Total'},
category_orders={'Continent': df_ai_filtered_2['Continent'].unique()})
fig.update_layout(legend_title_text='Continent')
fig.show()
3. Most popular languages, databases, and platforms among data scientists¶
So we are going to display treemap to visualize the most popular languages, databases, and platforms among data scientists.
Insights:¶
- Language: The most popular language among data scientists is Python.
- Database: The most popular database among data scientists is PostgresSQL and MySQL, relational databases are the most popular.
- Cloud Platform: The most popular cloud platform among data scientists is AWS, followed by Azure and GCP. Databricks is taking a momentum.
- Web Framework: FASTAPI is the most popular web framework among data scientists.
- Miscellaneous Tech: Pandas and Numpy are relevant on this field.
- Tools: Docker and PIPL are the most popular tools
- Collaboration Tools: VSCode and Jupiter are the most popular collaboration tools.
- Office Stack Async: Jira is the most popular tools.
- Office Stack Sync: Slack and Microsoft Teams are the most popular tools.
- AI usage: ChatGPT, Github Copilot and Gemini are the most popular tools.
def plot_treemap_from_column(df_column):
languages_series = df_column.str.split(';').explode()
language_counts = languages_series.value_counts().reset_index()
language_counts.columns = ['Technology', 'Count']
fig = px.treemap(language_counts, path=['Technology'], values='Count',
color='Count', color_continuous_scale='Viridis')
fig.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig.show()
plot_treemap_from_column(df_ai['LanguageHaveWorkedWith'])
plot_treemap_from_column(df_ai['LanguageWantToWorkWith'])
plot_treemap_from_column(df_ai['DatabaseHaveWorkedWith'])
plot_treemap_from_column(df_ai['DatabaseWantToWorkWith'])
plot_treemap_from_column(df_ai['PlatformHaveWorkedWith'])
plot_treemap_from_column(df_ai['PlatformWantToWorkWith'])
plot_treemap_from_column(df_ai['WebframeHaveWorkedWith'])
plot_treemap_from_column(df_ai['WebframeWantToWorkWith'])
plot_treemap_from_column(df_ai['MiscTechHaveWorkedWith'])
plot_treemap_from_column(df_ai['MiscTechWantToWorkWith'])
plot_treemap_from_column(df_ai['ToolsTechHaveWorkedWith'])
plot_treemap_from_column(df_ai['ToolsTechWantToWorkWith'])
plot_treemap_from_column(df_ai['NEWCollabToolsHaveWorkedWith'])
plot_treemap_from_column(df_ai['NEWCollabToolsWantToWorkWith'])
plot_treemap_from_column(df_ai['OfficeStackAsyncHaveWorkedWith'])
plot_treemap_from_column(df_ai['OfficeStackAsyncWantToWorkWith'])
plot_treemap_from_column(df_ai['OfficeStackSyncHaveWorkedWith'])
plot_treemap_from_column(df_ai['OfficeStackSyncWantToWorkWith'])
plot_treemap_from_column(df_ai['AISearchDevHaveWorkedWith'])
plot_treemap_from_column(df_ai['AISearchDevWantToWorkWith'])